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Abstract 

We present the Bayesian Case Model (BCM), a general framework for Bayesian 
case-based reasoning (CBR) and prototype classification and clustering. BCM 
brings the intuitive power of CBR to a Bayesian generative framework. The BCM 
learns prototypes , the “quintessential” observations that best represent clusters in 
a dataset, by performing joint inference on cluster labels, prototypes and impor¬ 
tant features. Simultaneously, BCM pursues sparsity by learning subspaces , the 
sets of features that play important roles in the characterization of the prototypes. 
The prototype and subspace representation provides quantitative benefits in inter- 
pretability while preserving classification accuracy. Human subject experiments 
verify statistically significant improvements to participants’ understanding when 
using explanations produced by BCM, compared to those given by prior art. 


1 Introduction 

People like to look at examples. Through advertising, marketers present examples of people we 
might want to emulate in order to lure us into making a purchase. We might ignore recommendations 
made by Amazon.com and look instead at an Amazon customer’s Listmania to find an example of a 
customer like us. We might ignore medical guidelines computed from a large number of patients in 
favor of medical blogs where we can get examples of individual patients’ experiences. 

Numerous studies have demonstrated that exemplar-based reasoning, involving various forms of 
matching and prototyping, is fundamental to our most effective strategies for tactical decision¬ 
making (|26,0 ED). For example, naturalistic studies have shown that skilled decision makers 
in the fire service use recognition-primed decision making , in which new situations are matched to 
typical cases where certain actions are appropriate and usually successful |[2TI . To assist humans in 
leveraging large data sources to make better decisions, we desire that machine learning algorithms 
provide output in forms that are easily incorporated into the human decision-making process. 

Studies of human decision-making and cognition provided the key inspiration for artificial intelli¬ 
gence Case-Based Reasoning (CBR) approaches l2l[28l. CBR relies on the idea that a new situation 
can be well-represented by the summarized experience of previously solved problems lf28fl . CBR 
has been used in important real-world applications 124, 4), but is fundamentally limited, in that it 
does not learn the underlying complex structure of data in an unsupervised fashion and may not 
scale to datasets with high-dimensional feature spaces (as discussed in (29)). 

In this work, we introduce a new Bayesian model, called the Bayesian Case Model (BCM), for 
prototype clustering and subspace learning. In this model, the prototype is the exemplar that is most 
representative of the cluster. The subspace representation is a powerful output of the model because 
we neither need nor want the best exemplar to be similar to the current situation in all possible ways: 
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for instance, a moviegoer who likes the same horror films as we do might be useful for identifying 
good horror films, regardless of their cartoon preferences. We model the underlying data using a 
mixture model, and infer sets of features that are important within each cluster (i.e., subspace). This 
type of model can help to bridge the gap between machine learning methods and humans, who use 
examples as a fundamental part of their decision-making strategies. 

We show that BCM produces prediction accuracy comparable to or better than prior art for standard 
datasets. We also verify through human subject experiments that the prototypes and subspaces 
present as meaningful feedback for the characterization of important aspects of a dataset. In these 
experiments, the exemplar-based output of BCM resulted in statistically significant improvements 
to participants’ performance of a task requiring an understanding of clusters within a dataset, as 
compared to outputs produced by prior art. 

2 Background and Related Work 

People organize and interpret information through exemplar-based reasoning, particularly when they 
are solving problems ([26 ( 7,[9, 21]). AI Cased-Based Reasoning approaches are motivated by 
this insight, and provide example cases along with the machine-learned solution. Studies show 
that example cases significantly improve user confidence in the resulting solutions, as compared to 
providing the solution alone or by also displaying a rule that was used to find the solution liTTl . 
However, CBR requires solutions (i.e. labels) for previous cases, and does not learn the underlying 
structure of the data in an unsupervised fashion. Maintaining transparency in complex situations 
also remains a challenge lf29ll . CBR models designed explicitly to produce explanations (T| rely on 
the backward chaining of the causal relation from a solution, which does not scale as complexity 
increases. The cognitive load of the user also increases with the complexity of the similarity measure 
used for comparing cases H4l . Other CBR models for explanations require the model to be manually 
crafted in advance by experts f25| . 

Alternatively, the mixture model is a powerful tool for discovering cluster distributions in an un¬ 
supervised fashion. However, this approach does not provide intuitive explanations for the learned 
clusters (as pointed out in 0). Sparse topic models are designed to improve interpretability by re¬ 
ducing the number of words per topic I1321 1T31. However, using the number of features as a proxy for 
interpretability is problematic, as sparsity is often not a good or complete measure of interpretability 
0. Explanations produced by mixture models are typically presented as distributions over fea¬ 
tures. Even users with technical expertise in machine learning may have a difficult time interpreting 
such output, especially when the cluster is distributed over a large number of features lfT4l . 

Our approach, the Bayesian Case Model (BCM), simultaneously performs unsupervised clustering 
and learns both the most representative cases (i.e., prototypes) and important features (i.e., sub¬ 
spaces). BCM preserves the power of CBR in generating interpretable output, where interpretability 
comes not only from sparsity but from the prototype exemplars. 

In our view, there are at least three widely known types of interpretable models: sparse linear 
classifiers (130, 8] El]); discretization methods, such as decision trees and decision lists (e.g., 
C2E2HI31I23HI51); and prototype- or case-based classifiers (e.g., nearest neighbors not or a super¬ 
vised optimization-based method 0). (See d for a review of interpretable classification.) BCM is 
intended as the third model type, but uses unsupervised generative mechanisms to explain clusters, 
rather than supervised approaches Ifl6l or by focusing myopically on neighboring points 0. 

3 The Bayesian Case Model 

Intuitively, BCM generates each observation using the important pieces of related prototypes. The 
model might generate a movie profile made of the horror movies from a quintessential horror movie 
watcher, and action movies from a quintessential action moviegoer. 

BCM begins with a standard discrete mixture model mm to represent the underlying structure 
of the observations. It augments the standard mixture mod el w ith prototypes and subspace feature 
indicators that characterize the clusters. We show in Section p~2] that prototypes and subspace feature 
indicators improve human interpretability as compared to the standard mixture model output. The 
graphical model for BCM is depicted in Figure [l] 
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Figure 1: Graphical model for the Bayesian Case Model 


We start with N observations, denoted by x = {xi, X 2 ,..., xjy}, with each Xi represented as a ran¬ 
dom mixture over clusters. There are S clusters, where S is assumed to be known in advance. (This 
assumption can easily be relaxed through extension to a non-parametric mixture model.) Vector 7 ^ 
are the mixture weights over these clusters for the i th observation x i9 7^ G M+. Each observation 
has P features, and we denote the j th feature of the i th observation as Xij. Each feature j of the 
observation xi comes from one of the clusters, the index of the cluster for x^- is denoted by Zij and 
the full set of cluster assignments for observation-feature pairs is denoted by z. Each takes on the 
value of a cluster index between 1 and S. Hyperparameters q, A, c, and a are assumed to be fixed. 

The explanatory power of BCM results from how the clusters are characterized. While a standard 
mixture model assumes that each cluster take the form of a predefined parametric distribution (e.g., 
normal), BCM characterizes each cluster by a prototype , p S9 and a sub space feature indicator , uj s . 
Intuitively, the subspace feature indicator selects only a few features that play an important role in 
identifying the cluster and prototype (hence, BCM clusters are sub space clusters). We intuitively 
define these latent variables below. 

Prototype, p s : The prototype p s for cluster s is defined as one observation in x that maximizes 
p(p s |cc s , z, x), with the probability density and uj s as defined below. Our notation for element j of 
p s is p s j. Since p s is a prototype, it is equal to one of the observations, so p s j = for some i. 
Note that more than one maximum may exist per cluster; in this case, one prototype is arbitrarily 
chosen. Intuitively, the prototype is the “quintessential” observation that best represents the cluster. 

Subspace feature indicator uj s : Intuitively, uj s ‘turns on’ the features that are important for charac¬ 
terizing cluster 5 and selecting the prototype, p s . Here, uj s G {0,1} P is an indicator variable that 
is 1 on the subset of features that maximizes p(uj s \p s , z, x), with the probability for uj s as defined 
below. Here, uj s is a binary vector of size P, where each element is an indicator of whether or not 
feature j belongs to subspace 8. 

The generative process for BCM is as follows: First, we generate the subspace clusters. A sub¬ 
space cluster can be fully described by three components: 1) a prototype, p s , generated by sampling 
uniformly over all observations, 1... TV; 2) a feature indicator vector, 00 s , that indicates important 
features for that subspace cluster, where each element of the feature indicator (uo s j) is generated 
according to a Bernoulli distribution with hyperparameter q\ and 3) the distribution of feature out¬ 
comes for each feature, for subspace s, which we now describe. 

Distribution of feature outcomes (j) s f° r cluster s : Here, f s is a data structure wherein each “row” 
fsj is a discrete probability distribution of possible outcomes for feature j. Explicitly, f s j is a vector 
of length Vj, where Vj is the number of possible outcomes of feature j. Let us define 0 as a vector 
of the possible outcomes of feature j (e.g., for feature ‘color’, 0 = [red, blue, yellow]), where 0 V 
represents a particular outcome for that feature (e.g., 0 V = blue). We will generate f s so that it 
mostly takes outcomes from the prototype p s for the important dimensions of the cluster. We do this 
by considering the vector g , indexed by possible outcomes v, as follows: 

9p s j,u s j,\{ v ) = A(1 + c\\ Wsj = 1 p sj = © v ] ), 

where c and A are constant hyperparameters that indicate how much we will copy the prototype in 
order to generate the observations. The distribution of feature outcomes will be determined by g 
through f s j ^ Dirichlet(g Psi ^ SJ ,A) • To explain at an intuitive level: First, consider the irrelevant 
dimensions j in subspace s, which have w s j = 0. In that case, f S j will look like a uniform distribu- 
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tion over all possible outcomes for features j ; the feature values for the unimportant dimensions are 
generated arbitrarily according to the prior. Next, consider relevant dimensions where w s j = 1. In 
this case, c/) s j will generally take on a larger value A + c for the feature value that prototype p s has on 
feature j, which is called @ v . All of the other possible outcomes are taken with lower probability A. 
As a result, we will be more likely to select the outcome © v that agrees with the prototype p s . In the 
extreme case where c is very large, we can copy the cluster’s prototype directly within the cluster’s 
relevant subspace and assign the rest of the feature values randomly. 

An observation is then a mix of different prototypes, wherein we take the most important pieces of 
each prototype. To do this, mixture weights 7 r* are generated according to a Dirichlet distribution, 
parameterized by hyperparameter a. From there, to select a cluster and obtain the cluster index Zij 
for each Xij, we sample from a multinomial distribution with parameters 7^. Finally, each feature for 
an observation, x^, is sampled from the feature distribution of the assigned subspace cluster 
(Note that Latent Dirichlet Allocation (LDA) (6) also begins with a standard mixture model, though 
our feature values exist in a discrete set that is not necessarily binary.) Here is the full model, with 
hyperparameters c, A, q, and a: 

uo s j ~ Bernoulli(g) Vs,j p s ~ Uniform(1, N) Vs 

4> sj ~ Dirichlet(g Psi :Usj ,a) Vs, j where g Paj , U3j ,\(v) = A(1 + cl Ki=landpsi=e „]) 

7 Ti ^ Dirichlet(cy) Vi ~ Multinomial(71$) Vi, j ~ Multinomial (<fi z . ,y) V/.j. 

Our model can be readily extended to different similarity measures, such as standard kernel methods 
or domain specific similarity measures, by modifying the function g. For example, we can use the 
least squares loss i.e., for fixed threshold e, g Psj , Uaj ,\( v ) = A(1 + cty Waj= i wA ( Psj _ Sv y< e] )-, or, 
more generally, g P3j , Usj ,\(v) = A(1 + cl 

[w S j= 1 and £(p s j ,O v )<e]) • 

In terms of setting hyperparameters, there are natural settings for a (all entries being 1). This 
means that there are three real-valued parameters to set, which can be done through cross-validation, 
another layer of hierarchy with more diffuse hyperparameters, or plain intuition. To use BCM for 
classification, vector 7 ^ is used as S features for a classifier, such as SVM. 

3.1 Motivating example 

This section provides an illustrative example for prototypes, subspace feature indicators and sub¬ 
space clusters, using a dataset composed of a mixture of smiley faces. The feature set for a smiley 
face is composed of types, shapes and colors of eyes and mouths. For the purpose of this example, 
assume that the ground truth is that there are three clusters, each of which has two features that are 
important for defining that cluster. In Table [TJ we show the first cluster, with a subspace defined by 
the color (green) and shape (square) of the face; the rest of the features are not important for defining 
the cluster. For the second cluster, color (orange) and eye shape define the subspace. We generated 
240 smiley faces from BCM’s prior with a = 0.1 for all entries, and q = 0.5, A = 1 and c = 50. 
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Table 1: The mixture of smiley faces for LDA and BCM 
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BCM works differently to Latent Dirichlet Allocation (LDA) ED, which presents its output in a very 
different form. Table[l]depicts the representation of clusters in both LDA (middle column) and BCM 
(right column). This dataset is particularly simple, and we chose this comparison because the two 
most important features that both LDA and BCM learn are identical for each cluster. However, LDA 
does not learn prototypes, and represents information differently. To convey cluster information 
using LDA (i.e., to define a topic), we must record several probability distributions - one for each 
feature. For BCM, we need only to record a prototype (e.g., the green face depicted in the top row, 
right column of the figure), and state which features were important for that cluster’s subspace (e.g., 
shape and color). For this reason, BCM is more succinct than LDA with regard to what information 
must be recorded in order to define the clusters. One could define a “special” constrained version 
of LDA with topics having uniform weights over a subset of features, and with “word” distributions 
centered around a particular value. This would require a similar amount of memory; however, it loses 
information, with respect to the fact that BCM carries a full prototype within it for each cluster. 

A major benefit of BCM over LDA is that the “words” in each topic (the choice of feature values) are 
coupled and not assumed to be independent - correlations can be controlled depending on the choice 
of parameters. The independence assumption of LDA can be very strong, and this may be crippling 
for its use in many important applications. Given our example of images, one could easily generate 
an image with eyes and a nose that cannot physically occur on a single person (perhaps overlapping). 
BCM can also generate this image, but it would be unlikely, as the model would generally prefer to 
copy the important features from a prototype. 


BCM performs joint inference on prototypes, subspace feature indicators and cluster labels for ob¬ 
servations. This encourages the inference step to achieve solutions where clusters are better repre¬ 
sented by prototypes. We will show that this is beneficial in terms of predictive accuracy in Sec¬ 
tion [4T] We will also show through an experiment involving human subjects that BCM’s succinct 


representation is very effective for communicating the characteristics of clusters in Section 4.2 


3.2 Inference: collapsed Gibbs sampling 


We use collapsed Gibbs sampling to perform inference, as this has been observed to converge 
quickly, particularly in mixture models UTI . We sample u S j 9 Zij , and p S9 where (j) and 7 r are in¬ 
tegrated out. Note that we can recover 0 by simply counting the number of feature values assigned 
to each subspace. Integrating out and 7 r results in the following expression for sampling Z{j : 


p(zij = s\zi^j,x.,p,u,a,\) oc 




a + n 


g(Psj,u s j, A) + 
E s 9(Psj^sj,\) T 


( 1 ) 


where n^ Si ij iV ) = t {zij = s,Xij = v). In other words, if x i3 takes feature value v for feature j 
and is assigned to cluster s , then = 1, or 0 otherwise. Notation ri( Sl .j, v ) is the number of 

times that the j th feature of an observation takes feature value v and that observation is assigned to 
subspace cluster s (i.e., nf 3 ,.,j iV ) = ^ ^( z ij = s i x ij = v ))- Notation means sum over 

i and v. We use to denote a count that does not include the feature j. The derivation is 

similar to the standard collapsed Gibbs sampling for LDA mixture models ED- 


Similarly, integrating out <j> results in the following expression for sampling co S j: 

q x 

p(i 0 sj = b\q,p S j, A, (j), x, z, a) oc < 


b = 1 


B( g(psj j 1 j A)) 

, „ w B(p(p aj ,0,A) +n (Si .j i . ) ) u _ n 1 ' 

1 q x , - — D — u . 


B{g(p sj ,0,X)) 

where B is the Beta function and comes from integrating out <p variables, which are sampled from 
Dirichlet distributions. 


4 Results 

In this section, we show that BCM produces prediction accuracy comparable to or better than LDA 
for standard datasets. We also verify the interpretability of BCM through human subject experiments 
involving a task that requires an understanding of clusters within a dataset. We show statistically 
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Figure 2: Prediction test accuracy reported for the Handwritten Digit ca and 20 Newsgroups 
datasets 1221 (a) applies SVM for both LDA and BCM, (b) presents the unsupervised accuracy 
of BCM for Handwritten Digit (top) and 20 Newsgroups (bottom) and (c) depicts the sensitivity 
analysis conducted for hyperparameters for Handwritten Digit dataset. Datasets were produced by 
randomly sampling 10 to 70 observations of each digit for the Handwritten Digit dataset, and 100- 
450 documents per document class for the 20 Newsgroups dataset. The Handwritten Digit pixel 
values (range from 0 to 255) were rescaled into seven bins (range from 0 to 6). Each 16-by-16 pixel 
picture was represented as a ID vector of pixel values, with a length of 256. Both BCM and LDA 
were randomly initialized with the same seed (one half of the labels were incorrect and randomly 
mixed), The number of iterations was set at 1,000. S = 4 for 20 Newsgroups and S = 10 for 
Handwritten Digit, a = 0.01, A = 1, c = 50, q = 0.8. 


significant improvements in objective measures of task performance using prototypes produced by 
BCM, compared to output of LDA. Finally, we visually illustrate that the learned prototypes and sub¬ 
spaces present as meaningful feedback for the characterization of important aspects of the dataset. 

4.1 BCM maintains prediction accuracy. 

We show that BCM output produces prediction accuracy comparable to or better than LDA, which 
uses the same mixture model (Section [5]) to learn the underlying structure but does not learn ex¬ 
planations (i.e., prototypes and subspaces). We validate this through use of two standard datasets: 
Handwritten Digit ED and 20 Newsgroups 1221 We use the implementation of LDA available from 
|27|, which incorporates Gibbs sampling, the same inference technique used for BCM. 

Figure [2a| depicts the ratio of correctly assigned cluster labels for BCM and LDA. In order to com¬ 
pare the prediction accuracy with LDA, the learned cluster labels are provided as features to a sup¬ 
port vector machine (SVM) with linear kernel, as is often done in the LDA literature on cluster¬ 
ing 0. The improved accuracy of BCM over LDA, as depicted in the figures, is explained in part 
by the ability of BCM to capture dependencies among features via prototypes, as described in Sec¬ 
tion [3] We also note that prediction accuracy when using the full 20 Newsgroups dataset acquired 
by LDA (accuracy: 0.68± 0.01) matches that reported previously for this dataset when using a com¬ 
bined LDA and SVM approach [33|. Also, LDA accuracy for the full Handwritten Digit dataset 
(accuracy: 0.76 d= 0.017) is comparable to that produced by BCM using the subsampled dataset (70 
samples per digit, accuracy: 0.77 d= 0.03). 

As indicated by Figure |2bJ BCM achieves high unsupervised clustering accuracy as a function of 
iterations. We can compute this measure for BCM because each cluster is characterized by a pro¬ 
totype - a particular data point with a label in the given datasets. (Note that this is not possible for 
LDA.) We set a to prefer each 7^ to be sparse, so only one prototype generates each observation, 
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Descriptions of unnamed categories 


New dish 

salt 

water 

vanilla 

sugar 

peach 

flour 

egg 

cream cheese 
sour cream 
whipped cream 
baking powder 
Category 1 
Category 2 
Category 3 
Category 4 


Categories 


brownie 

cheesecake 

chili 

chutney 

cornbread 

crepe 

curry 

gingerbread 

meatloaf 

pancake 

pasta 

pizza 

potato 

salad 

punch 

quiche 

salsa 


Representative examples of categories 


Category 1 

Category 2 

Category 3 

Category 4 

Example 

Example 

Example 

Example 

powder sugar 

vinegar 

chili powder 

vinegar 

cracker 

apple 

broth 

potato 

sour cream 

cinnamon 

cumin 

bell pepper 

cream cheese 

cranberry 

garlic 

carrot 

egg 

ginger 

onion 

hot sauce 

pie filling 

raisin 

pepper 

olive 

sugar 

sugar 

tomato juice 

onion 

vanilla 


tomato paste 

sugar 

butter 


oil 

oil 



turkey 

chili powder 



black bean 

corn 



corn 

salt 


Figure 3: Web-interface for the human subject experiment 


and we use that prototype’s label for the observation. Sensitivity analysis in Figure |2c| indicates that 
the additional parameters introduced to learn prototypes and subspaces (i.e., g, A and c) are not too 
sensitive within the range of reasonable choices. 


4.2 Verifying the interpretability of BCM 

We verified the interpretability of BCM by performing human subject experiments that incorporated 
a task requiring an understanding of clusters within a dataset. This task required each participant 
to assign 16 recipes, described only by a set of required ingredients (recipe names and instructions 
were withheld), to one cluster representation out of a set of four to six. (This approach is similar 
to those used in prior work to measure comprehensibility [20].) We chose a recipe datase^for this 
task because such a dataset requires clusters to be well-explained in order for subjects to be able to 
perform classification, but does not require special expertise or training. 

Our experiment incorporated a within-subjects design, which allowed for more powerful statistical 
testing and mitigated the effects of inter-participant variability. To account for possible learning 
effects, we blocked the BCM and LDA questions and balanced the assignment of participants into 
the two ordering groups: Half of the subjects were presented with all eight BCM questions first, 
while the other half first saw the eight LDA questions. Twenty-four participants (10 females, 14 
males, average age 27 years) performed the task, answering a total of 384 questions. Subjects were 
encouraged to answer the questions as quickly and accurately as possible, but were instructed to take 
a 5-second break every four questions in order to mitigate the potential effects of fatigue. 

Cluster representations (i.e., explanations) from LDA were presented as the set of top ingredients 
for each recipe topic cluster. For BCM we presented the ingredients of the prototype without the 
name of the recipe and without subspaces. The number of top ingredients shown for LDA was set as 
the number of ingredients from the corresponding BCM prototype and ran Gibbs sampling for LDA 
with different initializations until the ground truth clusters were visually identifiable. 

Using explanations from BCM, the average classification accuracy was 85.9%, which was statisti¬ 
cally significantly higher (c 2 (l, TV = 24) = 12.15 ,p <C 0.001) than that of LDA, (71.3%). For 
both LDA and BCM, each ground truth label was manually coded by two domain experts: the first 
author and one independent analyst (kappa coefficient: 1). These manually-produced ground truth 
labels were identical to those that LDA and BCM predicted for each recipe. There was no statisti¬ 
cally significant difference between BCM and LDA in the amount of time spent on each question 
(£(24) = 0.89, p = 0.37); the overall average was 32 seconds per question, with 3% more time spent 
on BCM than on LDA. Subjective evaluation using Likert-style questionnaires produced no statisti¬ 
cally significant differences between reported preferences for LDA versus BCM. Interestingly, this 
suggests that participants did not have insight into their superior performance using output from 
BCM versus that from LDA. 


Computer Cooking Contest: http://liris. cnrs .fr/ccc/ccc2014/ 
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Subspace A illllii 
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Iteration 

(a) Handwritten Digit dataset 
Figure 4: Learned prototypes and subspaces for the Handwritten Digit and Recipe datasets. 


Prototype (Recipe names) Ingredients ( Subspaces ) 


basil, garlic, Italian seasoning, | oil | 
pasta pepper I salt, | tomato | 


Herbs and Tomato in Pasta 


Generic chili recipe 


beer 


chili powder cumin, gar¬ 


lic, meat, oil, onion, pepper, salt, 


tomato 


Microwave brownies 


baking powder sugar, butter, 


chocolate | chopped pecans, eggs, 


flour, salt, vanilla 


Spiced-punch 


cinnamon stick, 

lemon juice 

orange juice 


pineapple juice 


sugar, water, whole cloves 
(b) Recipe dataset 


Overall, the experiment demonstrated substantial improvement to participants’ classification accu¬ 
racy when using BCM compared with LDA, with no degradation to other objective or subjective 
measures of task performance. 

4.3 Learning subspaces 

Figure [4a|illustrates the learned prototypes and subspaces as a function of sampling iterations for the 
Handwritten Digit dataset. For the later iterations, shown on the right of the figure, the BCM output 
effectively characterizes the important aspects of the data. In particular, the subspaces learned by 
BCM are pixels that define the digit for the cluster’s prototype. 

Interestingly, the subspace highlights the absence of writing in certain areas. This makes sense: For 
example, one can define a ‘7’ by showing the absence of pixels on the left of the image where the 
loop of a ‘9’ might otherwise appear. The pixels located where there is variability among digits of 
the same cluster are not part of the defining subspace for the cluster. 

Because we initialized randomly, in early iterations, the subspaces tend to identify features common 
to the observations that were randomly initialized to the cluster. This is because uj s assigns higher 
likelihood to features with the most similar values across observations within a given cluster. For 
example, most digits ‘agree’ (i.e., have the same zero pixel value) near the borders; thus, these are 
the first areas that are refined, as shown in Figure [4a| Over iterations, the third row of Figure |4a| 
shows how BCM learns to separate the digits “3” and “5,” which tend to share many pixel values in 
similar locations. Note that the sparsity of the subspaces can be customized by hyperparameter q. 

Next, we show results for BCM using the Computer Cooking Contest dataset in Figure|4b| Each pro¬ 
totype consists of a set of ingredients for a recipe, and the subspace is a set of important ingredients 
that define that cluster, highlighted in red boxes. For instance, BCM found a “chili” cluster defined 
by the subspace “beer,” “chili powder,” and “tomato.” A recipe called “Generic Chili Recipe” was 
chosen as the prototype for the cluster. (Note that beer is indeed a typical ingredient in chili recipes.) 

5 Conclusion 

The Bayesian Case Model provides a generative framework for case-based reasoning and prototype- 
based modeling. Its clusters come with natural explanations; namely, a prototype (a quintessential 
exemplar for the cluster) and a set of defining features for that cluster. We showed the quantitative 
advantages in prediction quality and interpretability resulting from the use of BCM. Exemplar-based 
modeling (nearest-neighbors, case-based reasoning) has historical roots dating back to the beginning 
of artificial intelligence; this method offers a fresh perspective on this topic, and a new way of 
thinking about the balance of accuracy and interpretability in predictive modeling. 
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